The dataset includes information about schools in the NYC district, focusing on variables like total enrollment, percentages of students with disabilities, English learners, and those living in poverty. By analyzing these metrics, I aim to uncover relationships and pose questions that could drive future data-driven strategies for improving equity and resource allocation across NYC schools.
The dataset consists of 5 key columns:
Additionally, a derived column, poverty_level, was created to categorize schools into "Low" or "High" poverty based on their poverty percentage.
Describe any steps taken to clean and preprocess the data:
Document the reasoning for each cleaning step.
# Import libraries
import pandas as pd
import plotly.express as px
# Added so that it renders on my macbook, only use the lines below if you have issues
import plotly.io as pio
pio.renderers.default = 'notebook'
# Read the CSV file into a pandas DataFrame
df = pd.read_csv("datasets/schools.csv")
df.head()
| school_name | total_enrollment | percent_students_with_disabilities | percent_english_learners | percent_poverty | |
|---|---|---|---|---|---|
| 0 | 47 The American Sign Language and English Seco... | 176.6 | 27.56 | 6.12 | 86.08 |
| 1 | A. Philip Randolph Campus High School | 1394.8 | 13.20 | 9.94 | 83.70 |
| 2 | A.C.E. Academy for Scholars at the Geraldine F... | 469.8 | 10.30 | 8.18 | 65.50 |
| 3 | ACORN Community High School | 391.0 | 27.08 | 4.16 | 83.88 |
| 4 | Abraham Lincoln High School | 2161.8 | 15.48 | 15.88 | 72.00 |
# To gain a clear understanding of the dataset's structure, I used school_data.describe() to generate summary statistics for the numeric variables
df.describe()
| total_enrollment | percent_students_with_disabilities | percent_english_learners | percent_poverty | |
|---|---|---|---|---|
| count | 1834.000000 | 1834.000000 | 1834.000000 | 1834.000000 |
| mean | 588.309481 | 21.944242 | 13.284084 | 75.030873 |
| std | 481.135549 | 15.793045 | 13.986404 | 19.179663 |
| min | 12.000000 | 0.000000 | 0.000000 | 3.950000 |
| 25% | 313.250000 | 14.963750 | 4.120000 | 68.065000 |
| 50% | 476.700000 | 19.340000 | 9.070000 | 79.680000 |
| 75% | 694.950000 | 24.415000 | 17.785000 | 88.580000 |
| max | 5591.800000 | 100.000000 | 99.600000 | 99.380000 |
For most columns, the mean and median are fairly close, which suggests the data is distributed evenly without extreme values. However, total enrollment is different. Its mean is much higher than the median, and it has a very large standard deviation. This suggests there are schools with extremely low and extremely high enrollment numbers, which are pulling the average up. The minimum and maximum values confirm this, showing a wide range of enrollment sizes, likely with a few schools being outliers.
To better understand the distribution above, creating a histogram of total student enrollment in NYC public high schools will allow us to visualize the spread and confirm whether the data is skewed or clustered in specific ranges.
# Plot the data showing the total student enrollment
# Simple
# px.histogram(df, x="total_enrollment")
# Plot the data showing the total student enrollment
# With customizations
fig = px.histogram(
df,
x="total_enrollment",
nbins=30, # Adjust bin size
title="Distribution of Total Enrollment in NYC Public High Schools",
labels={"total_enrollment": "Total Enrollment", "count": "Number of Schools"},
color_discrete_sequence=["#636EFA"],
)
# Add mean and median as vertical lines
mean_enrollment = df["total_enrollment"].mean()
median_enrollment = df["total_enrollment"].median()
fig.add_vline(x=mean_enrollment,
line_dash="dash",
line_color="purple",
annotation_text="Mean",
annotation_position="top right",
annotation_font_color="purple"
)
fig.add_vline(x=median_enrollment,
line_dash="dot",
line_color="red",
annotation_text="Median",
annotation_position="top left",
annotation_font_color="red"
)
fig.update_layout(
xaxis_title="Total Enrollment",
yaxis_title="Number of Schools",
title_font_size=18,
xaxis=dict(showgrid=True),
yaxis=dict(showgrid=True),
margin=dict(l=40, r=40, t=60, b=40) # Add margins
)
fig.show()
The total enrollment histogram shows a right-skewed distribution, with most schools having enrollments under 1,000 students. A significant proportion of schools have enrollments between 300 and 349 students, forming a clear cluster in this range.
There are several outliers in the data. The main curve of the distribution tapers off around 1,500 students, but a few schools exceed 2,000 or even 3,000 students. Notably, one school stands out with an enrollment of over 5,000 students.
These outliers in the tail of the distribution have a substantial impact on the mean, as it is sensitive to extreme values. In a right-skewed distribution like this, outliers pull the mean higher, making it less representative of the central tendency of the majority of schools. This highlights the importance of considering both the mean and median when analyzing data with significant skewness.
Next, I was curious about the distribution of schools based on the proportion of students with disabilities, the percentage of English Learners, and the percentage of students from Low-Income Families to identify any areas that need more resources or uncover an patterns.
# Visualization showing the % of students with disabilities in schools
Add insights of what your visualizations reveals.
# Visualization showing the % of students who are learning English as a second language in schools
Add insights of what your visualization reveals.
# Visualization showing the % of students whose families are below the poverty line in schools
Add insights of what your visualization reveals.
# Scatter plot of percent_english_learners vs percent_poverty
Add insights from findings of scatter plot.
df.corr(numeric_only=True)
| total_enrollment | percent_students_with_disabilities | percent_english_learners | percent_poverty | |
|---|---|---|---|---|
| total_enrollment | 1.000000 | -0.176817 | 0.027771 | -0.143091 |
| percent_students_with_disabilities | -0.176817 | 1.000000 | 0.020820 | 0.079578 |
| percent_english_learners | 0.027771 | 0.020820 | 1.000000 | 0.315477 |
| percent_poverty | -0.143091 | 0.079578 | 0.315477 | 1.000000 |
*Add insights from findings from df.corr() *
Add final thoughts and recommendations.